Chukwa: A System for Reliable Large-Scale Log Collection
نویسندگان
چکیده
Large Internet services companies like Google, Yahoo, and Facebook use the MapReduce programming model to process log data. MapReduce is designed to work on data stored in a distributed filesystem like Hadoop’s HDFS. As a result, a number of log collection systems have been built to copy data into HDFS. These systems often lack a unified approach to failure handling, with errors being handled separately by each piece of the collection, transport and processing pipeline. We argue for a unified approach, instead. We present a system, called Chukwa, that embodies this approach. Chukwa uses an end-to-end delivery model that can leverage local on-disk log files for reliability. This approach also eases integration with legacy systems. This architecture offers a choice of delivery models, making subsets of the collected data available promptly for clients that require it, while reliably storing a copy in HDFS. We demonstrate that our system works correctly on a 200-node testbed and can collect in excess of 200 MB/sec of log data. We supplement these measurements with a set of case studies describing real-world operational experience at several sites.
منابع مشابه
A Search Log-Based Approach to Evaluation
Anyone offering content in a digital library is naturally interested in assessing its performance: how well does my system meet the users’ information needs? Standard evaluation benchmarks have been developed in information retrieval that can be used to test retrieval effectiveness. However, these generic benchmarks focus on a single document genre, language, media-type, and searcher stereotype...
متن کاملA Reliable and Economically Feasible Automatic Meter Reading System Using Power Line Distribution Network (TECHNICAL NOTE)
Automatic Meter Reading (AMR) is the remote collection of consumption data from customer’s utility meters over telecommunications, radio, power line and other links. AMR provides water, electric and gas utility−service companies the opportunities to streamline metering, billing and collection activities, increase operational efficiency and improve customer service. Utility company uses technolo...
متن کاملA New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model
Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...
متن کاملDesign of a Log Server for Distributed and Large-Scale Server Environments
Collection, storage and analysis of multiple hosts’ audit trails in a distributed manner are known as a major requirement, as well as a major challenge for enterprise-scale computing environments. To ease these tasks, and to provide a central management facility, a software-suit, named as “LogHunter” has been developed. Log-Hunter is a secure distributed log server system which involves log col...
متن کاملAn Ant Colony Optimization Algorithm for Network Vulnerability Analysis
Intruders often combine exploits against multiple vulnerabilities in order to break into the system. Each attack scenario is a sequence of exploits launched by an intruder that leads to an undesirable state such as access to a database, service disruption, etc. The collection of possible attack scenarios in a computer network can be represented by a directed graph, called network attack gra...
متن کامل